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System and Method for 
Cross Correlation 
with Application to 
Video Motion Vector Estimation 



10 



Field of the Invention 

Hie field of this invention is Digital Signal Processing (DSP) and, more specifi- 
cally, the field of Cross Coirelation computation with application in image pro- 
cessing and video motion vector estimation. Application of this invention is also 
15 suitable for 2-dimensional DSP such as pattern recognition, signature analysis, 
and feature extraction. 



Background of the Invention 

One application of video data compression is in a video teleconferencing system, 
and a simple block diagram of such a system is shown in Figure 1. A video tele- 
20 conferencing system allows two or more remote parties to both see and talk to 
ichthyosaui: Each party has: a video camera 1; a video encoder 2; a video 
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decoder3; a video display 4; a microphone 5; an audio encoder 6; an audio 
decoder 7; and an audio speaker 8. Encoded video and audio is transmitted 
across a data channel 9. 

Most video data compression applications, of which video teleconferencing is an 
5 example, have either or both a video encoder 2 and a video decoder 3. Example 
schemes for a video encoder 2 and a video decoder 3 are shown in Figure 2 and 
Figure 3, respectively. 

Efficient video compression involves the combination of transform coding and 
predictive coding. This type of compression scheme is sometimes called full 

10 motion compression and has being incorporated into, for example, the Consulta- 
tive Committee for International Telegraphy and Telephony (CCITT) R261 
standard In this scheme, a video encoder 2 predicts 19 the current picture frame 
using motion vectors 21 and the previous picture frame stored in the frame 
memory 23, then computes the error 25 of the predicted picture frame 19 when 

15 compared to the current picture frame 27. Motion vectors 21 are generated in the 
motion vector estimator 28, which compares the current frame 27 with the previ- 
ous frame (stored in frame memory 23) and finds a "best" match, generally on a 
block-by-block basis. The encoder then codes the prediction error using a Dis- 
crete Cosine Transform (DCT) based transform 29, quantizes the data in a 

20 quantizer (Q) 31, and sends it along with the motion vectors 21 over a transmis- 
sion channel 9 after sending them through a variable length coder (VLQ 33. At 
the receiving end, the decoder 3 uses the motion vectors 21 and the previous pic- 
ture frame stored in the frame memory 35 to predict the current picture frame 37 
and then adds the prediction error from the inverse DUT (JDCT) 39 to the pie- 

25 dieted frame before displaying it 41. 

The description of a video encoder 2 and a video decoder 3 is very simple and 
generalized, and is included here to show the critical role played by motion vec- 
tors in video compression applications. The process of motion vector estimation 
is an essential and computationally expensive part of inter-frame predictive cod- 
30 ing which is needed to achieve high compression ratios. The present patent pre- 
sents a system and method for cross correlation, which can be used for motion 
vector estimation 28 and, therefore, be included in a video encoder 2. 

A picture frame is made of many blocks of pixels, commonly called macro 
blocks. For example, in the Common Interchange Format (OF), the picture 

35 frame has 22x18 macro blocks, where, each macro block consists of 16x16 
pixels. In inter-frame encoding mode, a motion vector for each macro block must 
be estimated and sent along with the encoded prediction error for the macro 
block. Motion vector estimation is done by comparing a macro block in the 
present frame to a search block in the previous frame and finding a "best" match. 

40 The motion vector is estimated to be the offset between the position of the macro 
block in the present frame and the location of the "best" matching area within the 
search block. Typically, the search block is restricted to 32x32 pixels and is 
centered at the same location within the picture frame as the macro block. The 
"best" match is a subject to the cost function used for evaluating the various 

45 choices, and a variety of cost functions have been discussed in the literature. In 
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the book, "Discrete Cosine Transform Algorithms, Advantages, Applications", 
by K- R. Rao and R Yip, published by Academic Press, Inc. San Diego, CA, 
1990, cost functions arc described on pages 242 to 247. The commonly used cost 
function is sum of absolute differences, and is mathematically expressed as 
5 Equation 1. 

(EQ1) 

1 15 15 

mae(*>"0 = i6 X i6 X X + /+"0| n,m = 0,...,l 

i = 0 / = o 

In Equation 1, jc (/, /) with i, /= 0, 15 is the macro block in the current pic- 
ture frame, and y (i, /) with i, /= 0, 31 is the search block in the previous 
frame. For each motion vector estimate, a total of 17x17 =289 two-dimen- 
sional summations are computed, and the smallest value determines the "best" 

10 motion vector estimate. This algorithm is sometimes referred to as the mean 
absolute error (MAE) algorithm. The MAE algorithm has been implemented in 
an integrated circuit, the L64720 Video Motion Estimation Processor (MEP) 
from LSI Logic Corporation, Mlpitas, CA. The computation for the MAE is rel- 
atively simple, and the algorithm is effective in providing motion vector esti- 

15 mates for video transmission where high data rate, and hence low compression 
ratio, is allowed. For example, in video conferencing using transmission lines 
with 128 kbits per second or more, this is quite adequate. However, for lower 
data rate transmissions which require higher compression ratios, a more accurate 
motion vector estimation process is required since it provides better prediction 

20 and hence, reduction of the error which is transmitted or stored in inter-frame 
coding. 

An alternative, but better, cost function is the cross correlation function (CCF). 
The CCF is a more powerful cost function for determining data block matches, 
however it has not been implemented in any real time systems due to its compu- 
25 tational complexity. The CCF is given by Equation 2. 

CEQ2) 

15 15 15 15 

2 5>0'.0y('+n,/+m)- £ X y(' + n,/+iw) 

p5 ~H 3 Mi rs "i 



Again, a total of 289 values must be computed, and the largest one determines 
the "best" motion vector estimate. Clearly, the computation of the CCF involves 
multiplication and is, therefore, more computationally expensive than MAE, 
30 which needs only addition and absolute value. The CCF, as defined above in 
Equation 2, also requires square root, which can be avoided by squaring the 
entire expression. The squared version of CCF, called CCF2, is shown in 
Equation 3. 
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<EQ3) 

[15 15 15 15 -i* 
X X*tt07(«>«.'+») - X *fc« X >(i>«,/+m) 
_ = 0 1=0 f./=0 u°o J 

~F~i5 713 .2-1 r- 15 715 72 

X ( X^ 0 x^ ,>n - /+m >-( X^ (/+n - /+m) ) 

The CCF or CCF2 approach can estimate the match more precisely, and hence, a 
more accurate motion vector can be estimated. This is crucial if a higher data 
compression ratio is to be achieved. For example, it is especially important to 
5 achieve a higher compression ratio when a standard analog telephone line, 
which, at present, supports up to only 192 kilobits per second, is used to transmit 
video signals. 

An additional variation of the CCF, which can be used for motion vector estima- 
tion, is the cross covariance, CCV, as shown in Equation 4. The CCV is simply 
. 10 the numerator of the CCF, thus eliminating the need to compute 

and its square roots. Similarly, the largest value of CCV determines the "best" 
match. 

15 15 15 15 

CccvOvm) = X X x ft37tf+*l+m) - X X K'+M+m) (EQ4) 

Another approach to motion vector estimation is the normalized mean square 
IS error; NMSE, which is shown in Equation 5, and the smallest value determines 
the "best" match. 

15 15 15 15 

XVao+ X /a+M+»o-2X X*tt J >J<' + * l+m > 

C NMSE (n,m)=ii^ Si=L B <fUL^> (EQ5) 

X ^fto 

The NMSE can be simplified by eliminating its denominator to form the mean 
square error approach, MSE, shown as Equation 6, and the smallest value deter- 
20 mines the "best" match. 

15 15 15 15 

The present invention describes a method and apparatus that efficiently computes 
the computationally expensive summation common to Equation 2 through 
Equation 6, which is shown in Equation 7, and hereinafter called simply the 
25 "cross correlation*'. Cross correlation can itself be used as a cost function for 
motion vector estimation* 

15 15 

c(n,m) = £ Jx(l,i)3f(t+n,/+m) n,m = 0, 16 (EQ7) 
i = 0 / = o 

Direct calculation of the cross correlation requires 256 multiplies and 255 adds 
for each resulting value. The innovative method presented here converts the two 
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blocks, both the macro block and the search block, to their frequency domain 
representation via the Discrete Fourier Transform (DFT) and then multiplies the 
two blocks together to form a product matrix in the frequency domain. The fre- 
quency domain result is then converted back to spatial domain and the 289 cross 
5 correlation values are simultaneously available to be searched for the maximum 
value. Hie present patent describes a complete method for this computation and 
includes several particularly innovative "tricks" which make the method attrac- 
tive for implementation in hardware. 

In the detailed description of the invention, the method is illustrated graphically 
10 and each stage is precisely described. Since various block sizes can be chosen 
and also since cross correlation for block matching has applications in many 
areas, such as pattern recognition, signature analysis, and feature extraction, the 
dimensions in the figures and description are kept in general terms. 

In this Digital Signal Processing (DSP) application, namely cross coirelation 
15 computation, there is a need to perform numerical data processing at a very high 
rate. In many DSP architectures, high data rate processing is achieved through 
the use of multiple Arithmetic Units (AU) such as combinations of adders, multi- 
pliers, accumulators, dividers, shifters, and multiplexors. However, there are two 
major difficulties with using multiple AUs in parallel: first, many control signals 
20 are needed to control multiple AUs; and, second, it is difficult to get the correct 
data words to each AU input on every clock cycle. 

Some DSP architectures are Single Instruction Multiple Data (SIMD) architec- 
tures. A SIMD architecture, as defined here, groups its AUs into identical Pro- 
cessors, and these Processors perform the same operation on every clock cycle, 

25 except on different data. Hence, a Processor within a SIMD architecture can have 
multiple AUs, with each AU operating in bit parallel fashion (some definitions of 
SIMD architectures have AUs operating in bit serial fashion, which is not the 
case here). The SIMD architecture is applicable to video image compression 
because images are split into identically sized blocks which can be processed in 

30 parallel If there are four Processors, then four blocks within the image can be 
processed in parallel, and the Processors are controlled in parallel with the same 
instruction stream. 

Many DSP applications, especially real-time applications such as video com- 
pression, perform the same set of operations over and over. In video compres- 

35 sion, successive images are compressed continuously with the same set of 
operations. Within the complex function of video compression, there are simpler 
functions such as convolution, Fourier transform, and correlation, to name only a 
few. These simpler functions can be viewed as subroutines of the complex func- 
tion of video compression. These simple functions can be broken down further 

40 into elementary subroutines; for example, the 2-dimensional Discrete Fourier 
Transform (DFT) of a 32x32 data matrix can be implemented with 64 calls to 
an elementary subroutine which preforms an 32-point DFT. The cross correlation 
method of the present patent includes a SIMD architecture which efficiently per- 
forms a wide variety of elementary subroutines, including those needed for cross 

45 correlation computation. 
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Brief Description of the Drawings 

Rgure 1: Video teleconferencing system* 

Figure 2: Video encoder. 

Figure 3: Video decoder. 

5 Rgure 4: Overview of the cross correlation method for motion vector estimation. 

Rgure 5: Relative sizes of the search block, data block, and cross correlation for 
motion vector estimation. 

Rgure 6: Cross correlation method for estimating motion vectors for two pairs of 
data blocks. 

10 Rgure 7: Cross correlation method for estimating motion vectors for one pair of 
data block. 

Rgure 8: Three stage Processor architecture for computing SLTs. This is prior art 
for the present patent 

Rgure 9: General block diagram of a Two Stage Processor utilizing two multi- 
15 port memories. Arithmetic Units (AUs) in the Top Stage get their operands from 
the Top Multiport RAM and the Arithmetic Units in the Bottom Stage get their 
operands from the Bottom Multipart RAM. AU outputs are both stored back into 
the same stage's multiport RAM and passed either to the next stage or the Output 
Bus. 

20 Rgure 10: SIMD architecture utilizing parallel Two Stage Processors. 

Rgure 11: A particular Two Stage Processor architecture utilizing parallel multi- 
port memories. The Top Stage consists of a four-port RAM and an adder The 
Bottom Stage consists of a six-port RAM, a multiply-accumulator, and an adder 

Rgure 12: Two Stage Processor Architecture similar to that of Figure 11 except 
25 the number of Arithmetic Units has been doubled in order to allow direct manip- 
ulation of complex numbers. 

Rgure 13: Two Stage Processor Architecture similar to that of Figure 12 except a 
Crossbar Multiplexor has been inserted after the all output buses for both multi- 
port memories. These Crossbar Multiplexors allow swapping the real and imagi- 
30 nary parts of complex numbers before they are input to an Arithmetic Unit 

figure 14: Two Stage Processor Architecture similar to that of Rgure 13 except 
only one multiply-accumulator is included. If complex numbers are processed, 
the single multiply-accumulator; real and imaginary parts are multiplied sequen- 
tially. 

Rgure 15: Two Stage Processor Architecture similar to that of Rgure 14 except 
35 the six-port RAM has implemented with two separate RAM arrays: one RAM 
array for the real part and one RAM array for the imaginary part 

figure 16: Two Stage Processor Architecture similar to that of Figure 15 except a 
shifter has been added to the input of Arithmetic Units. 
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Rgure 17: Two Stage Processor Architecture similar to that of Figure 13 except a 
crossbar multiplexor and a shifter has been added to the input of Arithmetic 
Units. 

Overview of the Invention 

5 This invention describes two efficient methods for computing cross correlation 
of two 2-dimensional data blocks. First, step-by-step operations represented by a 
set of equations on input data blocks are presented. Though the intended applica- 
tion is motion vector estimation in video compression applications, the method is 
suitable for many applications. Therefore, the dimensions of input data blocks 

10 are arbitrary. Since several key operational steps within the method are Discrete 
Fourier Transforms (DFIfc), the description goes on to introduce a set of DSP 
architectures which compute a special class of DFT algorithms as well as the 
other steps required to complete the entire method. 

Computation of a cross correlation of two linear, that is, one dimensional, 
15 sequences is essentially a convolution process. It has long been known that con- 
volutions can be performed by fast convolution techniques which utilize DFTs. 
In essence, a convolution can be performed by performing a Fourier transform 
on the two data records, multiplying the transform results together, and then per- 
forming an inverse Fourier transform on the product to get the final result In the 
20 book, "Fast Fourier Transform and Convolution Algorithms" by 
H. J, Nussbaumer, second edition published by Springer- Verlag in 1982, many 
fast convolution algorithms are discussed. The present patent utilizes the funda- 
mental concept of convolution by Fourier transform and expands it to two 
dimensions and adds the necessary additional steps to perform two simultaneous 
25 2-dimensional cross correlations. 

Figure 4 is an overview of the method of the present patent as it is applied to 
motion vector estimation. One or more data blocks 51,53 within the present 
frame 27 are selected and converted 55 to form a matrix of complex data. Next, 
the complex matrix is transformed 57 to the frequency domain to form the fre- 
quency domain representations of the selected data blocks 51^3, A set of search 

30 blocks 59,61 within the previous frame 63, with a one-to-one correspondence to 
the selected data blocks 51,53, is also selected, converted 65 into a complex data 
matrix, and transformed 67 to the frequency domain to form the frequency 
domain representations of the selected search blocks 59,61. Once in the fre- 
quency domain, the two sets of data are multiplied 69 together, then inverse 

35 transformed 71 to get back to the spatial domain. The data then passes through an 
adjustment process 73 to form the cross correlations 75,77 between the pairs 
(pairs in the diagram are: 51 and 59; 53 and 61) of data blocks 51,53 and search 
blocks 59,61. 

Figure 5 is a representation of the relative sizes of the search block 59, data 
40 block 51, and cross correlation 75. For example, this figure can be viewed as a 
pictorial representation of Equation 7 in which M x = M 2 = 32, 
N x = N 2 = 16, Z>! = D 2 = 16, and y(i + n,/+m) replaced with 
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y (i + n-8,/ + m-8) where n,m = -8, 8- The data blockSl is always 
smaller than the search block 59. The size of each dimension of the cross 
correlation 75 is less than or equal to one plus the difference of the corresponding 
dimensions of the search block 59 and the data block 51. For optimal use of com- 

5 putation in the method of the present patent, the size of each dimension of the 
cross correlation 75 is exactly equal to one plus the difference of the correspond- 
ing dimensions of the search block 59 and the data block 5L In essence, one 
cross correlation 75 value places the data block 51 at a particular location within 
the search block 59, multiplies the corresponding entries in the two blocks 51,59 

10 together (that is, those values on top of each other in Figure 5), and sums these 
products. For other cross correlation 75 values, different locations of the data 
block 51 within the search block 59 are chosen. The cross correlation 75 is a 
matrix of values which contains all the possible overlaps of the data block 51 
within the search block 59. The method of the present patent includes the step 

15 "convert to complex" 55 for the data block 51 which not only makes complex 
data values, but also pads the resulting data matrix with zeros so that the result- 
ing matrix is the same size as that generated by the "convert to complex" 65 step 
for the search block 59. This step allows to compute spatial correlation using 
DFT and frequency domain multiplication. The method of the present patent suc- 

20 cessfiilly avoids padding the search block 59 with zeros by only keeping the 
cross correlation 75 values which are of interest (that is, I\+ lxD 2 +l values) 
even though more values can be calculated (that is, M i x2V 2 values). For motion 
vector estimation, the cross correlation 75 values are searched for the largest 
value. If other approaches, such as CCF, CCF2, CCV, NMSE, or MSE, are used, 

25 additional computation is performed on the cross correlation 75 values before 
they are searched. 

The DFT most naturally operates on, and generates, complex data. The classic 
example of this is the standard Fast Fourier Transform (FFT) algorithm which 
uses successive stages of radix-2 "butterflies" for computing transforms of length 
30 2 fl , where a is an integer: However, the applications of interest here do not start 
with complex data, they start with real data. There exist DFT algorithms for real- 
only data, but they are not as computationally efficient as complex DFT algo- 
rithms. 

The two methods, Method A 100 and Method B 101, that take the advantage of 
35 the DFT property of operating on complex data are presented below. Method 
A 100 takes two pairs of real input blocks and computes two cross correlation 
simultaneously; whereas Method B 101 takes one pair of real input blocks and 
computes one cross correlation. While both methods can be performed on the 
same hardware architecture, Method A 100 contains more repetitions of elemen- 
40 tary subroutines and Method B 101 requires less cache memory. 

Hgure 6 is a pictorial representation of Method A 100 of the invention of the 
present patent In this section of the patent, a general textual description of the 
method is given, with a following mathematical description under the detailed 
description section. Each box in Figure 6 represents a block of data, and arrows 
45 represent processing steps performed on the blocks of data. Step 1 102 takes two 
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input blocks 103,105 of real data (two search blocks for motion vector estima- 
tion) and combines them into a single block 107 of complex data by making the 
first input block 103 the real part and the second input block 105 the imaginary 
part This is an innovative part of the method which allows performing two 

5 simultaneous 2-dimensional cross correlations. In Step 2 109, the resulting 
block 107' of Step 7 102 has DFTs performed along each of its rows, resulting in 
a new block of data 11L For Step 2 109, a special type of DFT can be performed, 
namely the Short Length Transform (SLT), as described on pages 144 to 150 of 
Nussbaumer's boot The SLT minimizes the number of operations, especially 

10 multiplies, but requires more complicated hardware architectures and controls to 
realize it for real time applications. Consequently, the SLTs have not been used in 
real time cross correlation because there has not been a good hardware architec- 
ture for SLT computation; the present method solves this problem with the archi- 
tectures described later. In Step 3 113, DFTs are performed along the columns of 

15 the block HI resulting from Step 2 109; and an SLT can also be used here. In 
Step 4 115, the resultant block 117 of Step 3 113 is split into two blocks 119,121, 
where these two blocks 119,121 are each of conjugate symmetry and are the fre- 
quency domain representation of the two real input data blocks 103,105. In other 
words, the two blocks 119,121 are made up of pairs of complex numbers that are 

20 conjugate of each other. This means only half of the data points in the 
blocks 119,121 need to be generated, thus saving computation. Exact details of 
the innovative set of operations which take advantage of this symmetry are 
described in the detailed description section. 

Step 5 123 in Figure 6 takes two input blocks of real data 125,127 (two macro 

25 blocks for motion vector estimation) and combines them into one block of com- 
plex data 129 by making the first input block 125 the real part and the second 
input block 127 the imaginary part The two input blocks 125427 for Step 5 123 
are smaller than the input blocks 103,105 in Step 1 102 and, therefore, the result- 
ant block of Step 5 123 must be padded with zeros to increase it size to be the 

30 same as the resultant block from Step 1 102. Like Step 1 102, Step 5 123 allows 
the performing of two simultaneous 2-dimensional cross correlations. Step 6 131 
is a set of DFTs along the rows of the resultant block 129 of Step 5 123 and here 
again SLTs can be used However, since the resultant block 129 of Step 5 123 has 
been padded with zeros, only some of the rows need to be transformed, thus a 

35 saving in computations by reducing the number of DFTs required In addition, 
since some of the elements in these rows of the resultant block 129 are zero due 
to the zero padding of Step 5 123, special SLTs can be developed to further 
reduce the total number of computations. This new special SLT will be described 
in detail later. Step 7133 is a set of DFTs along the columns of the resultant 

40 block 135 of Step 6 131. In Step 8 137, the resultant block 139 of Step 7 133 is 
split into two blocks 141,143, where these two blocks 141443 are each of conju- 
gate symmetry and are the frequency domain representation of the two real input 
data blocks 125427 for Step 5 123. Step 8 137 has the same innovative features 
zsStep4US. 

45 Steps 1 102 through 8 137 take four input blocks 103,105,125,127 and generate 
their respective frequency domain representations 119,121441,143, Steps 9 145 
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and 10 147 each take pairs of these frequency domain blocks (pairs: 119 and 141; 
121 and 143) and multiply them together to form two new blocks 149,151. These 
two blocks 149,151 have similar conjugate symmetry property, so once again 
computations are saved because only half of each block 149,151 needs to be 
5 computed Step 11 153 merges the resultant blocks 149,151 of Steps 9 145 
and 10 147 together to form one block 155 of complex data. This innovative 
merge step generally produces a complex matrix 155 without conjugate symme- 
tries, and the mathematical operations are described later. 

Step 12 157 is a set of inverse DFTs, or IDFTs, along die columns of the resultant 
10 block 155 of Step 11 153. Step 13 159 is a set of IDFTs along fee rows of the 
resultant block 161 of Step 12 157. The SLTs are used in both Step 12 157 and 
Step 13 159. Step 14 163, the final step in generating the two desired cross corre- 
lations, is done by simply splitting the resultant block 165 of Step 13 159 into 
two blocks 167,169 by taking the real part and the imaginary part Since the 
15 resulting blocks 167,169 are smaller than the normally expected resultant 
block 165 of Step 13 159, it tons out IDFTs need only be applied to some of the 
block rows in Step 13 159, thus allowing another reduction in the total computa- 
tions. This, in turn, means some of the rows of the resultant block 161 in 
Step 12 157 are not needed, and another special SLT can be developed to further 
20 reduce die number of total computations. For motion vector estimation, the two 
resultant blocks 167,169 of Step 14 163 are each searched for their largest value, 
and the indices of these two maximum values are the two estimated motion vec- 
tors for the two macro blocks 125427 which were inputs to Step 5 123. 

In the above method, the order of performing row DFTs and column DFTs is 
25 arbitrary, and hence, can be interchanged. For example, Step 2 109 can perform 
DFTs on columns, provided Step 3 113 performs DFTs on rows. likewise, when 
only half of a data block 119,121441,143,149,151 is needed to be computed, it is 
an arbitrary choice as to which half is computed. 

Figure 7 shows the procedures for Method B 10!L In this method, only two data 
30 blocks 201,203 are input to the system and one cross correlation 205 is com- 
puted. Step 1 207 combines the search block 201 into a complex data block 209 
by making the upper half of input block 201 the real part and the lower half 201 
the imaginary part As the result, this complex block 209 only has half the size of 
the input block 20L This is an innovative part of the method that allows the use 
35 of DFT on complex data and the reduction of the number of DFTs required. In 
Step 2 211, DFTS are performed along the rows of the complex block 209. In 
Step 3 213, the resultant block 215 is then split into a new complex block 217 
that corresponds to the input search block 201 with its rows DFTed. The splitting 
done in Step 3 213 generates only half the number of columns of the block 217 
40 since the other half has the conjugate symmetry property. DFTs along the col- 
umns of the block 217 are performed in Step 4 219. The resultant block 221 is 
the frequency domain representation of the input search block 20L 

In Step 5 223, the data block 203 is mapped into a complex data block 225 in the 
way similar to Step 1 20L Since the data block 203 has smaller size, the lower 
45 center portion of resultant complex data block 225 are nonzero and the rest are 
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zero-padded to the same size as the block209. Next, row DFTs are performed on 
the complex data block 225 in Step 6 HI. In Step 7 219, the row DFTed 
block 231 is split to form a new complex data block 233 to be column DFTed. 
Similar to the block resulting from splitting in Step 3 213, this complex 
5 block 233 has conjugate symmetry and half of it are generated. In Step 8 235, 
DFTs are performed along the columns of the block 233 to yield the frequency 
domain representation 237 of the input data block 203. 

The frequency domain multipHcation is performed in Step 9 239 but only the half 
the data are computed in the resultant block 241 due to conjugate symmetry. 

10 IDFTs along the columns of the block 241 are performed in Step 10 243. Before 
performing the IDFT on the rows, the resultant block 245 must first be split into 
another complex block 247 in Step 27 249. The resultant block 247 is then 
IDFTed in Step 12 251 along its rows to result the spatial domain block 253. 
Finally, in Step 13 255, the cross correlation 205 is obtained from the spatial 

15 domain block 253, which involves no operation. 

Like Method A 100, SLTs are used to perform both DFTs and IDFTs. Since 
blocks 255,231 contain zeros in their rows and columns, special SLTs can be 
used to reduce the total number of operations. 

Many elementary subroutines utilized above (for example, DFT, IDFT, and SLT) 

20 can be described as a cascade of adds and multiplies, and architectures which are 
well matched to specific elementary subroutines can perform these elementary 
subroutines in a single pass. For example, in Nussbaumer's book, the equations 
for SLTs can be viewed as an add-multiply-accumulate-add process. That is, 
input data points are added and subtracted in various ways to form the first set of 

25 intermediate results; the first set of intermediate results are multiply-accumulated 
in various ways to form the second set of intermediate results; and the second set 
of intermediate results are added and subtracted in various ways to form the final 
results. The hardware for this three-stage process of add-subtract, followed by 
multiply-accumulate and then by add-subtract, is shown in Figure 8. The archi- 

30 tecture of Figure 8 was utilized in the SETT DSP Engine design done at Stanford 
University in 1984, and is therefore, prior art for the present patent The SETI 
DSP Engine architecture was published in four places: 1) "VLSI Processors for 
Signal Detection in SETT' by Duluk, etaL, 37th International Astronautical Con- 
gress in Innsbruck, Austria, October 4-11, 1986; 2) "Artificial Signal Detectors" 

35 by Linscott, Duluk, Burr, and Peterson, international Astronomical Union, Col- 
loquium No. 99, Lake Balaton, Hungary, June 1987; 3) "Artificial Signal Detec- 
tors", by Linscott, Duluk, Burr, and Peterson, included in the book 
"Bioastronomy - The Next Steps", edited by George Marx, Kluwer Academic 
Publishing, pages 319-335, 1988; and 4) The MCSA n - A Broadband, High 
40 Resolution, 60 Mchannel Spectrometer", by Linscott, et al. (including Duluk, an 
author of the present patent), 24th Annual Asilomar Conference on Circuits, Sys- 
tems, and Computers, November 1990. 

The architecture of Figure 8, which, is prior art, can perform three arithmetic 
operations in parallel: two add-subtracts and one multiply-accumulate. The add- 
45 subtracts are performed in the Upper Adder 301 and the Lower Adder 303; while 
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the multiply-accuinulate is performed in the Middle MuMply-Accomnlate 305 
The paraUelism of three simultaneous arithmetic operations is achieved through 
the use of mulnportRandom Access Memories (RAM), sometimes called mull 
, 1°^^ ^ multi P ort RAMs in Figure 8 can perform ten simul- 

5 taneous read or wnte operations. The Upper Four-port RAM 307 simultaneously 
performs: two read operations for the operands of the Upper Adder 301: one 

SZr°^ e ^f„^ 3 ° lieS ^ ^ one wr£ operation^* 
from the Input/Output Bus 309. The Middle Two-port RAM 311 and the Lower 
Four-port RAM 313 perform similar functions. 

10 The Processor architecture of Figure 8, however, has limitations such as- 
Oreqtirmgdato^ 

Upper Adder301 to the input of the Lower Adder303; u)no provisions for 
direcfly inanmulahng complex numbers; W lowing only one multiply-accu- 
J^teperpass through me aic^^ 
15 tote per elementary subroutine; iv) only one AU per stage; v) only one Innut/ 
Output Bus 309; and vQ it is highly ^^f^SLT^rZln^^ 
cesser architecture of the present patent overcomes these limitations. 

The general block diagram of a Two Stage Processor utilizing two multinort 
20 STu ' 317 J nd AUs 319,321,323,325,327,329 is shoTS 

STw * f cJ f ctare p can 6e viewed as two cascaded stages: the Top 
Stage331 and fte Bottom Stage333. In general, AUs 319,321^23 in the Top 
Stage331 get to operands from the Top Multiport RAM 315 and the Arith- 
metic Units 325^27,329 in the Bottom Stage 333 get their operands fromthe 
25 d^ m ^ 0rt RA ^ 31 L? CTe * DOt necessaril y * one-lone co^Z 

25 %Sm£EZ*^-^ M ? tip0rt RAM31S ' 317 «ad ports ^and 
AU 319,321,323^25,327,329 operand inputs. AU outputs can be i) both stored 

S?r2 ' ^tr?,^?*^ f0 paSSed either to the next stage or 
fce ^ut Mulbplexor 335, or both of the previous. There is an Lut 

30 

30 AUs 319,321,325 can also get operands from a Auxiliary Input Buses 341J43 
which ^necessary for special data, such as constant coefficients within elemen- 
taiy subroutines. 

Multiple Two Stage Processors 345,347,349 can operate in a SIMD. configura- 

35 leL Each Two Stage Processor 345,347,349 receives the same rnicrocode 
instruction from the Microstore 351 via the Mcroinstruction Bus 353; hence the 
jingle Instruction" part of SIMD. The operations within a Two Stag! 
^"^7^9 are specified by ^rtca^imJ^ZZ 
operations include: 1) aU the operations of each AU 31932L323 325.327 

40 2) all muMport RAM315,3I7 addresses and «JEK£5£H Si 
bits, such as those for the Output Multiplexor 335. Therefore, one instrocZ 
specifies many operations. The Microstore 351 may include RAM, ROM and 
togic, depending on the implementation. The Microstore 351 must be designed 
to accountforprpe stages in AUs as well as for registers which could beplacS at 

45 the input and output of RAMs. ^ 
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Each Two Stage Processor 345,347,349 has its own Cache RAM 355,357,359 
which supplies its data; hence the "Multiple Data" part of SIMD. Cache 
RAMs 355,357,359 can be multiport RAM, a set of bank-switched single-port 
RAMs, or a mixture of single-port and multiport RAMs. In Figure 10, each 

5 Cache RAM 355,357,359 includes address generation for interfacing with both 
its I/O Bus and its Two Stage Processor 345,347,349; however, if the addresses 
for either of these interfaces was always the same, at least some of the address 
generation logic can be shared between Cache RAMs 355,357,359. In either 
case, the address generation logic used for transferring data between Cache 

10 RAMs 355,357,359 and Two Stage Processors 345,347,349 is controlled by bits 
from the Microstore 351 via the Cache Control Bus 361. Address generation 
logic used for transferring data between Cache RAMs 35535739 and the I/O 
Buses 365,367,369 is controlled by a set of bits from the I/O Controller 371 via 
the Cache Input/Output Control Bus 373. 

15 The set of corresponding Auxiliary Input Buses 375,377 for each Two Stage 
Processor 345,347,349 is driven by either an Auxiliary Data Look-Up Table 
(LOT) 379,381 or directly from the Microstore 351. Figure 10 shows two Auxil- 
iary Data LUTs 379,381, which are controlled by bits from the Microstore 351. 

The address of the Microstore 351 comes from the Microsequencer 383 via the 
20 Microstore Address Bus 385. The Microsequencer 383, in turn, receives its 
instructions from the Microstore 351 via the Microsequencer Instruction 
Bus 387. Both the Micro-sequencer 351 and the I/O Controller 371 receive con- 
trol signals via the External Control Bus 389. 

When particular embodiments of the SIMD architecture and the Two Stage 
25 Processor 345, 347,349 are considered for implementation, the number and type 
of AUs in each stage must be chosen. Also, the appropriate number of Auxiliary 
Input Buses 341,343 must be included. There is a trade off between the number 
of Two Stage Processors 345,347,349 and the complexity of the Two Stage 
Processor 345,347,349. 



30 Detailed Descriptions of Preferred Embodiments 

Many specific Two Stage Processor architectures are possible, hence, a preferred 
embodiment is presented to illustrate possible choices of AUs and multiport 
RAM configurations. The following preferred embodiments of the Two Stage 
Processor have been developed for image processing, with special attention to 
35 computation of motion vectors. 

Architecture with Two Adders and a Multiply-Accumulate 
The overview block diagram of the particular Two Stage Processor architecture 
within the scope of the present patent is shown in Figure 11. The architecture can 
be viewed as two cascaded stages: the Top Stage 391 and the Bottom Stage 393. 
40 The Top Stage 391 consists of the Top Four-Port RAM 395 and The Top 
Adder 397. The Bottom Stage 393 consists of the Bottom Six-Port RAM 399, 
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the Bottom Multiply- Accumulator 401, and the Bottom Adder 403. The Input 
Bus 337 supplies data to the Processor by inputting data into the Top Four-Port 
RAM 395. The Output Bus 339 of the Processor is driven by a Multiplexor 409 
which chooses between the output of the Bottom Multiply-Accumulator 401 and 
5 the output of the Bottom Adder 403. 

The two multiport RAMs 395,399 of Figure 11 simultaneously perform ten 
RAM accesses. More specifically; the ten accesses are: 1) a write access 411 to 
the Top Four-Port RAM 395 from the Input Bus 337; 2) a read access 413 from 
the Top Four-Port RAM 395 for the first operand for the Top Adder 397; 3) a 

10 read access 415 from the Top Four-Port RAM 395 for the second operand for the 
Top Adder 397; 4) a write access 417 to the Top Four-Port RAM 395 from the 
output of the Top Adder 397; 5) a write access 419 to the Bottom Six-Port 
RAM 399 from the output of the Top Adder 397; 6) a read access 421 from the 
Bottom Six-Port RAM 399 for the first operand for the Bottom Adder 403; 7) a 

15 read access 423 from the Bottom Six-Port RAM 399 for the second operand for 
the Bottom Adder 403; 8) a write access 425 to the Bottom Six-Port RAM 399 
from the output of the Bottom Adder 403; 9) a read access 427 from the Bottom 
Six-Port RAM 399 for the second operand for the Bottom Multiply- 
Accumulator 401; and 10) a write access 429 to the Bottom Six-Port RAM 399 

20 from the output of the Bottom Multiply-Accumulator 401. For independent con- 
trol of these ten RAM accesses 411,413,415, 417,419,421,423,425,427,429, ten 
addresses and five write enable signals must be supplied by the Microinstruction 
Bus 353. The Microinstruction 353 Bus also supplies: 0 the type of operations 
for the two adders, 397y403 namely, add, subtract, or pass data; and ii) the type of 

25 operation for the multiplier 401, namely, accumulate or not accumulate; and 
iff) the Output Multiplexor 409 select bit The Microinstruction Bus 353 is 
shown explicitly in Figure 10 and Figure 11, but for clarity, the Microinstruction 
Bus 353 is not shown in the other figures and is assumed to connect to all RAMs, 
AUs, Muxes, etc* in each Two Stage Processor. 

30 Also included in the architecture of Figure 11 is one Auxiliary Input Bus 431 for 
supplying the first Bottom Multiply- Accumulate 401 operand. In practice, an 
additional multiplexor could be added to the architecture to select between the 
Auxiliary Input Bus 431 and the second Bottom Multiply-Accumulate 401 oper- 
and; thus allowing the microcode to either multiply a number read from the Bot- 

35 torn Six-Port RAM 399 by an external coefficient supplied by the Auxiliary Input 
Bus 431 or to square die number read from the Bottom Six-Port RAM 399. The 
Output Bus 339 is driven by an Output Multiplexor 409, which selects between 
the output of the Bottom Multiply-Accumulator 401 and the output of the Bot- 
tom Adder 403. 

40 Inclusion of Complex Numbers 

The square root of -1 is normally shown mathematically by J-i and represented 
by the symbol/. Notation for a single complex number is a + jb 9 where a is the 
real part of the complex number and b is the imaginary part of the complex num- 
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ber. Hence, in computation hardware, two numbers are required to store the 
value of a single complex number. 

The computational ability of the Figure 11 architecture can be increased without 
increasing the number of read ports or write ports on the multiport 

5 RAMs 39539 by storing two numbers in each multiport RAM location and 
doubling the number of AUs 397,401,403. It is important to keep the number of 
multiport RAM read ports and write ports to a minimum because it is important 
to keep the number of bits in the Microinstruction Bus 353 to a minimum. Such 
an architecture is illustrated in Figure 12. Hence, the Top Adder 397 is replaced 

10 by the Top Real Adder 433 and the Top Imaginary Adder 435; the Bottom Multi- 
ply-Accumulate 401 is replaced by the Bottom Real Multiply-Accumulate 
(abbreviated MAC) 437 and the Bottom Imaginary Multiply-Accumulate 439; 
and the Bottom Adder 403 is replaced by the Bottom Real Adder 441 and the 
Bottom Imaginary Adder 443.The restriction of requiring two numbers to be 

15 stored in the same location is not overly restrictive if either i) complex numbers 
are being manipulated; or u) two sets of data are processed in parallel by the 
same Two Stage Processor 345,347,349. In Figure 12, if complex numbers are 
being manipulated, both halves of all complex numbers are processed in parallel 
The architecture of Figure 12 is not a fully complex version of Figure 11 because 

20 the Auxiliary Input Bus 431 is used for real numbers, not complex numbers. The 
architecture of Figure 11 could be made fully complex by changing all data 
buses, including the Auxiliary Input Bus 431, to complex buses and by changing 
all AUs 397,401,403 to handle complex numbers on all their inputs. 

In the figures, when a data bus splits into two buses (for example, one 32-bit bus 
25 splitting into two 16-bit buses), a "demerge" symbol 453 is used; and when two 
buses become unified into one bus, a "merge" symbol 455 is used In Figure 12, 
both the Input Bus 337 and the Output Bus 339 are used to transfer complex 
numbers. 

Inclusion of Multiply by -A in Arithmetic Units 
30 The process of multiplying a complex number by J^i r or can be done by swap- 
ping the real part of the complex number with the imaginary part and reversing 
the arithmetic sign of the new real part. Mathematically, this is 
j(a+jb) = -fr+ja. 

The architecture of Figure 13 adds the capability of multiplying any AU operand 
35 by j by: i) adding Crossbar Multiplexors 457,459,461,463,465 after the read 
ports of the multiport RAMs 39539, thereby optionally swapping the real and 
imaginary parts of a complex number; and ii) modifying the control bits supplied 
by the Microinstruction Bus 353 to the adders 433,435,441,443 and 
multipliers 437,439 to account for the arithmetic sign change. In Figure 13, an 
40 AU may be considered to include both a Crossbar Multiplexor 457,459, 
461,463,465 and either an adder 433,435,441,443 or multiplier 437,439. In prac- 
tice, since addition and multiplication are commutative (that is, a + Z> = Z> + a 
and ab = ba) 9 only one Crossbar Mux could be required per pair of AUs (pairs 
are: 433 and 435; 437 and 439; 441 and 443) because many elementary subrou- 
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tines can be written to requite only one of the two operands to an adder or multi- 
plier to be multiplied by j. 

Redaction in the Number of Multipliers 

In Figure 14, a Two Stage Processor architecture is shown with only one Multi- 
5 ply-Accumulate 476, which is otherwise identical to that of Figure 13. Since 
MAC functions are relatively expensive to implement, it is useful to reduce the 
number of MACs. Because of this fact, elementary subroutines are usually 
designed to minimize the number of required multiplies. A ratio of 4-to-l 
between adds and multiply-accumulates can be achieved for many elementary 
10 subroutines of interest In the architecture of Figure 14, the Bottom Six-Port 
RAM 399 has: i)two read ports 469,471 for complex numbers; if) two write 
ports 473,475 for complex numbers; w) one read port 477 for real numbers; and 
iv) one write port 479 for real numbers. For items Hi and zv, the Bottom Six-Port 
RAM 399 must be able to separately read and write the real and imaginary parts 
15 of stored complex numbers. Hence, the Bottom Six-Port RAM 399 in Figure 14 
must has four write enables rather than three. 

Since the Bottom Six-Port RAM 399 of Figure 14 might be difficult to build, an 
alternative is presented in Figure 15. The Bottom Six-Port RAM 399 is imple- 
mented with a Multiplexor 481 and two six-port RAMs 483,485: the Bottom 
20 Six-Port RAM, Real Part 483; and the Bottom Six-Port RAM, Imaginary 
Part 485. Four write enable signals must still be supplied, but two are shared by 
both RAMs 483,485. 

Inclusion of Scale Factors in the AUs 

In many elementary subroutines, arithmetic scale factors which are integer pow- 
25 ers of 2 are included. For example, an operation such as z = x+2y or 
z = x+y/2 can be done with an adder if it includes a scale factor capability. 
Another reason to include such scale factors is to account for different decimal 
positions in multiplier coefficients supplied by the Auxiliary Input Bus 431. 

In Figure 16, Shifters 487,489,491,493,495,497,499,501,503 have been added 
30 before the input of each AU 433,435,441,443,467. These Shifters 487/189,491, 
493,495,497,499, 501^03 can be designed to do right arithmetic shifts and/or 
left arithmetic shifts. If the AUs 433,435,441,443,467 are using floating point 
representation, the Shifter 487,489, 491,493,495,497,499,501,503 becomes an 
a dde r which is only applied to the exponent In practice, since addition and mul- 
35 tiplication are commutative, only one Shifter could be required per pair of AUs 
(pairs are: 433 and 435; 437 and 439; 441 and 443) because many elementary 
subroutines can be written to require only one of the two operands to an adder or 
multiplier to be scaled. Additionally; a shifter could be placed after the Output 
Multiplexor 409 in order scale the output as needed. 



WO 93/23816 



PCT/US93/04658 



Dual Multipliers, Multiplication by and Scale Factors 

If the features of multiplying by J^i 9 or j, and scale factors are added directly to 
the architecture of Figure 13, the architecture of Figure 13 results. This is also 
equivalent to the architecture of Figure 16 where two multiply- 
5 accumulators 437,439 are included Because there are two MACs 437,439, there 
must also be two Shifters 505,507. 

Computation of Transforms 

Hie above architectures are intended to perform the cross correlation computa- 
tion on blocks of data. A key portion of this computation is done with DFTs, 
10 which can, in turn, be efficiently computed with Short Length Transforms 
(SLTs). To provide an example of an SLT for further discussion, the set of opera- 
tions for a 16-point transform are labeled Equation 8 through Equation 13. Each 
of these "equations" is actually a set of simple equations which could be per- 
formed by the same AU within a Two Stage Processor. 

15 The input data vector for this 16-point SLT is complex, x k = x kr +jx ki9 
k = 0, 15, where is the real part denoted with the subscript V and x^is 
the imaginary part denoted with the subscript "f \ If the architecture of Figure 13 
is utilized, then all the operations within Equation 8 would be performed in the 
Top Real Adder 433. 

(EQ8) 



hr = x 0r + x $r 
f 5r~ X 6r + X 14r 
f 9r = X 3r + *llr 
'l3r = x 7r + x 15r 
hlr = / 15r + *16r 
*21r = '9r~'l3r 
f 75r = 'l0r + / 12r 
*29r = *15r~'l6r 

*33r = *4r~'6r 
hlT = X Mr~ X Ar 



*2r ZZX 4r + X 12r 
*6r ~ X 6r~ X \4r 
*10r = x 3r~ x llr 
f 14r ~ x lr ~ x l5r 

*22r = *18r + *20r 
*26r = hlr^hor 
f 30r = *lr~*2r 
*34r = f 24r +t 26r 
*38r = f 19r + *21r 



*3r-*2r + *10r 
hr = X lr + X 9r 
hlr = x 5r + x 13r 
hSr = / lr+*2r 
f 19r = t lr~hlr 
r 23r = hr + f 14r 
*27r = hlr + t llr 
hlr= x Qr~ x &r 
HSr ~ hor " h%r 
h9r = hr + hr 



Ur = x 2r~ x 10r 
*8r = X lr~ X 9r 
t \2r = X 5r~ X 13r 
h6r = hr + hr 
twr^hr + hSr 
r 24r = / Sr~ f 14r 
f 28r = hlr " *22r 
f 32r = f 19r~'21r 
*36r = '5r~'3r 
f 40r = *23r + *25r 
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'if = -% +x 8j 
hi~ x 6i^ x Ui 
f 9i = J 3i +X Ui 



'l3f = 


x 7i + *15i 


*17/ = 


*15i + *16* 


*21f = 


:f 9*~*13l 


r 25i = 


hQi + hli 


*29i = 


*15/~*16i 


f 33i = 




hli = 


X 12i~ X 4/ 



r 2i = - x 4i +x 12i 
f 6i = *6i~ X 14f 
*10r SI 3r JC U/ 
*14f ~ x Ti~ x lSi 

hzi^hi+hu 
tm^hn+twi 

f26f."'l2i"*10f 
*30i "*1j~*2i 
*34i =/ 24i + ^26i 
f 38£ = f 19r +f 21i 



f 3i =JC 2i +x 10/ 
*7i " x li* +x 9i 
f lli =x 5i +x 13i 
*15i = 'li +/ 2i 

*23i = r 8i + t 14i 



*27i = 


hli +t 22i 




' x Oi" x 8/ 




*20/~*18i 


*39i -U&Ui 



(EQ9) 



*4i = *2i~ x 10» 
f 8f =X lf" X 9i 
*12i = x 5i~ x 13i 
*16i = hi + hi 
*20r = / 9f + / 13/ 
*24f ~*8/~'l4/ 
*28i = *17/~*22i 
h2i = h9i~h\i 

hsi~hi^hi 

^40/ = f 23i + / 25/ 



Analogously, all the operations within Equation 9 would be performed in the Top 
Imaginary Adder 435. In both Equation 8 and Equation 9 the set of the first inter- 
mediate results, labeled through and t u through t m * are stored in pairs 
5 as complex numbers in either fee Top Four-Port RAM 395 or the Bottom 
Six-Port RAM 359. These results are only stored into the Top Four-Port 
RAM 395 if they are used on the right side of equations in either Equation 8 or 
Equation 9. For example, is generated by the equation = x Ar + but is 
used in the equation = hr +t 2r m & hor ~ *ir~*2r to g eneiat & t^ r and 

10 

All the operations within Equation 10 would be performed in the Bottom Real 
Multiply-Accumulate 437, while all the operations within Equation 11 would be 
performed in the Bottom Imaginary Multiply-Accumulate 439. 

nor^hir 00 * ( 2e > 

m 2r = *34r cos ( 38 ) 
m 4r = *26r ( C <* (38) - COS (9) ) 

m 6r = f 39/ sin (28) 
m *r = ^23f "(«tt(3B) - sin (9) ) 

ro 0i = f 32i COS ( 2e ) 

m 2l = f^cos (39) 
15 m Ai = t m icos (39) - cos (9) ) 
m Si = -t 39r siii(2Q) 



m lr=h3r C0S C 20 ) 

m 3r=^24r( C0S ( e ) +C0S ( 3e )) 
. /w 5r = f 38f sin(29) (EQ10) 

m 9r - t w (sin (8) + sin (39) ) 

m 1/ = f 33| .cos(29) 

m 3i = *2U C 008 + cos C 30 ) ) 

m 5/ = -f 38r sin(29) (EQll) 

m 7l - = -f^sin (39) 

m 9i = " r 25r (8) + ^ ( 3 8) ) 
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In Equation 10 and Equation 11, 6 = 2%/16. Constants, such as 
cos (30) - cos (6) , are supplied by the Auxiliary Input Bus 431. In some of the 
operations listed within Equation 10 and Equation 11, such as 
m 5r = f 38,* sin (26) , an imaginary part of a number, in this case / 38/ , is used to 

5 generate the teal part of another number, in this case m 5r . The same situation 
occurs for the respective imaginary and real parts, namely / 38r and m 5i . When 
this type of pair of operations are required, the Crossbar Multiplexor 461 swaps 
the real and imaginary parts of the number read from the Bottom Six-Port 
RAM 399 before it reaches the multipliers 437,439. This, in effect, coupled with 

10 appropriate arithmetic sign changes, multiplies numbers by j. 

All the operations within Equation 12 would be performed in the Bottom Real 
Adder 341, while all the operations within Equation 13 would be performed in 
the Bottom Imaginary Adder 343 

(EQ12) 



lr = '30r + m 0r 


5 2r = / 30r" m 0r 


5 3r = hirhsi 


s 4r ~ "'38r"^3i 




S 6r-hlr~ m lr 


S lr = m 3r~ m 2r 


s Sr = m 4r- 


m : 


: 9r = 5 5r + J 7r 


s lQr = s 5r~ s lr 


S llr~ s 6r^hr 


S 12r = S 6r " 




3r = m 6r~*37i 


s l4r = " m 6r " *37i 


S l5r = m lr + m *r 


s l6r = m 7r" 


m 


7r = 5 13r + 5 15r 


5 18r = s 13r~ s l5r 


S 19r = s 14r + S l6r 


S 20r = s 14r ~ 


s i 


X 0r = m 0r 


X lr = J 9r + ,y 17r 


X 2r = S lr + s 3r 


X 3r = S 12r " 


s 2 


4r = / 29r" / 35i 


X 5r = s llr + S 19r 


X 6r = s 2r + S 4r 


X lr = s \Or ~~ 


s i 


X %r ^ *28r 


*9r- 5 10r + *18r 


X l0r~ s 2r~ s 4r 


X llr- S llr~ s ] 


Mr = f 29r + r 35i 


X 13r = ,y 12r + *20r 


X Ur = s lr~ s 3r 


2 15r" J 9r" 


s l 
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(EQ13) 



^1/ Hoi^^i 


Si; — ^30/ ~~ HI f)t 


JOr 351 


Sa : ~ far. ' 




s 5i-hli + m U 




J 7i = 7W 3i~ m 2/ 


% = m 4i 


" m 2i 


s 9i~ s 5i + s 7i 




= ,s 6i + j 8i 




"% 


; 13i = m 6i + f 3lr 


J 14i = Hlr~ m 6i 


*15j = m 7/ + ,w 8i 






: l7i~ s 13i + s lSi 


s lSi = s 15i~~ s 15i 


^19i =J 14£ + ^16i 


s 2Di ~ s Ui 


"*16 


X 0i = m Qi 


*li = s 9i + s 17i 


X 2f = J lf + J 3i 


x 3i = s m 










X 7i = s 10i 




^8/ = f 28i 


*9i~ ff I0i + J 18i- 


X 10i = J 2i" J 4i 




-*is 


f 12i = *29i~*35r 


^13/ = ir 12i + 5 20i 


^14i = 5 1/"^3j 


*15i =J 9i' 





The output data vector for this 16-point SLT is complex, X k = X^+jXtf, 
k = 0, 15. The Crossbar Multiplexors 463,465 are used to swap inputs for 
operations such as: s Sr = t m -t m and s 3i = ^ r +^ 38f . In none of the equa- 
5 tions is there a situation where two imaginary parts of numbers are added to 
make a real part of a number; thus, one of the Crossbar Multiplexors 463,4(5 
could be removed from the architecture. However, removing a Crossbar 
Multiplexors 463,465 must be done with care, as some other elementary subrou- 
tine may need both Crossbar Multiplexors 463,465. 

10 In all the 16-point SLT operations, there are: 80 adds which take place in the Top 
Real Adder 433 and the Top Imaginary Adder 435; 20 multiplies which take 
place in the Bottom Real Multiply-Accumulate 437 and the Bottom Imaginary 
Multiply-Accumulate 439; and 72 adds which take place in the Bottom Real 
Adder 441 and the Bottom Imaginary Adder 443. If this 16-point SLT were per- 

15 formed many consecutive times, the multipliers 437,439 would be underutilized* 
This is why the architecture of Figure 14 is attractive; it eliminates an expensive 
component a multiplier: 

If the input data vector contains zeros the number of operations can be farther 
reduced. For example, if x k *0 for k = 0,...,7 butx fc = OforJfc = 8, ...,15, 

20 then a special 16-point zero-padded SLT can be developed. In this case, all the 
operations in Equation 8 and Equation 9 contributed by x k = 0 for 
k = 8, 15 can either be eliminated or amplified. This innovation is applied 
to Steps 6 131 and 7 133 in Figure 6 and Steps 6 227 and 8 235 in Figure 7 but 
with a 32-point SLT. A similar simplification can be carried through Equation 12 

25 to Equation 13 r if a part of output vector is not needed or discarded, they don't 
have to be computed. For example, if X k for = 8, 15 are discarded, then 
operations that compute these outputs can be eliminated from Equation 12 and 
Equation 13. This innovation is applied to Step 13 159 in Figure 6 and 
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Step 13 255 in Figure 7 but with a 32-point SLT. These special SLTs amount to 
about 25% saving in computation for some transforms. 

The operations for the above 16-point SLT example must be scheduled on an 
architecture when the architecture is implemented. That is, each operation must 

5 be scheduled to occur at a specific time in the hardware, and, the storage location 
of results of all operations must also be scheduled. In general, a schedule for an 
architecture and a particular set of equations can be generated as follows: 1) sort 
the equations according to which AU will perform them; 2) for each AU, sort its 
equations in precedence order, that is, equations which use results from other 

10 equations are put at the end of such a list; 3) generate a two-dimensional matrix 
with one row per equation for all equations and one column per clock period, that 
is, equation results on the vertical axis and time on the horizontal axis; 4) starting 
with the first equation in the precedence list for the first AU, put an "X" in the 
equation versus time matrix where the AU finishes the computation for that 

15 equation, and proceed to the rest of the equations for all AUs (there should be 
only one "X" per row of the matrix and only one "X" per column for a particular 
AU); 5) as an "X" is placed, care must be taken to make sure the inputs to the 
equation under consideration are available from a multiport memory 115417, it 
is necessary to place an "X 7 ' further to the right if its equation's inputs are not 

20 available until that clock cycle, this will leave vacant clock cycles for some AUs 
which will be filled by subsequent equations; 6) proceed until all equations have 
been scheduled, that is, each row of the matrix has an "X"; 7) if there are unoccu- 
pied clock cycles (column in the matrix) for an AU (which means a resource is 
not being fully used or partially wasted), "Xs" can be swapped in order to reduce 

25 the number of columns; 8) place a "Y" in each matrix location starting to the 
right of each "X" and continue placing "Ys" until the last clock cycle for which 
the result of that equation is needed by any other equation, hence, each row 
should have one "X" followed by a contiguous set of "Ys"; 9) place "Ys" for all 
"Xs n , the set of each "X" and "Ys" in a row show how long the result of an equa- 

30 tion needs to be stored in a multiport RAM 115,117; 10) sets of "X" followed by 
"Ys" are now grouped together provided they do not share any columns and the 
value is stored in the same multiport RAM 115,117, these groups represent equa- 
tion results which can be stored in the same location in the same multiport 
RAM 115417; 11) the number of groups, for a particular multiport 

35 RAM 115,117, are the number of register locations required for that multiport 
RAM 115,117, these register locations are now assigned; and 12) the schedule is 
complete and is translated into the "bits which need to loaded in the 
Microstore 351. This procedure for generating microcode can be mechanized by 
writing a computer program to do it This would allow sophisticated techniques 

40 to be used, such as the "simulated annealing" algorithm to be used in step 7 
described above. 

Essentially the same scheduling algorithm can be used for looping on sets of 
equations. However, in this case, the equation versus time matrix must be viewed 
as periodic, and placing an **X" is actually placing an entire series of "Xs", one 
45 for each period of the loop. 
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Another class of elementary subroutine of interest to real-time video compres- 
sion is the DCT. DCTs can be described by a set of equations similar to that of 
the above 16-point SLT example. However, many DCT equation sets include 
operations such as u 0 = h 0 + 2/z l9 which justifies die addition of 
5 Shite 487y489,49M93,495^497^99^01, 505,507 to the architecture. 



Computation of Cross Correlation — Method A 100 

In Figure 6, a set of procedures for Method A 100 are presented to show the 
sequence of computation on input data blocks. Without loss of generality, we 
denote a data block whose motion vector needs to be estimated to have a size of 
10 xN 2 ^ a search block a size of M x xM 2 - Also, we restrict that Af x > ^ 
and M 2 > N 2 . For simplicity of explanation, M v M 2 , N l9 andW 2 are assumed to 
be even, but the method can be easily modified for non-even data block sizes. 
Accordingly, the cross-correlation in Equation 14 has the general form: 

c(n,m) = £ X*0W)y(i+n,/+m) 1 1 (EQ14) 

ito ito m = 0,...,M 2 -iV 2 

15 where x(*,0 with j = 0, 1 and / = 0, ...,N 2 -1 is the data block, 

y (i, /) with i — 0, ..^Mj-l and / = 0, Af 2 -1 is the search block, and 
c(/z, m) with n=0, •..,Af 1 — N x and m =0, M 2 ~ N 2 is the cross correla- 
tion. If the input blocks have zero mean and unity variance, then Equation 14 
becomes the CCF Equation 2. Since each input block is DFTed first, if the aver- 

20 age or the mean of the block is not zero, it can be easily removed by simply forc- 
ing the DC component of die DFT to zero. This allows to compute a close 
approximation, called CCV2, to CCV in Equation 4, 

The difference is that y(i+/i,/+m) in Equation 4 is a moving average, 
25 whereas in Equation IS it is a globe average. Because the bulk of operation in the 
method involves forward and inverse DFTs, a pair of data blocks and a pair of 
search blocks can be formed into a data block and a search block of complex data 
structure. Consequently, this method computes two motion vectors simulta- 
neously. In the following description, y (i + n, / + m) in Equation 14 is replaced 
30 with y (i+n-Z^/2, /+m-Z> 2 /2) , where n = — Dj/2, ... t D x /2 and 
m = -I> 2 /2,... ,D 2 /2. 

Step 1101. Referring to Figure 6, the input data blocks y x (i 9 l) 103 and 
y 2 (f, /) 105 are first combined into a complex blocks w (i, [) 107 through the 
35 following relationship, 

i = 0, ...jMi-l 
/ = 0, ...,M 2 — 1 
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This arrangement involves no mathematical operation, just a conceptual rear- 
ranging of the data. 

Step 2 109. A DFT is performed along the rows (index I) of w (/, /) 107 is per- 
formed to yield W (i, s) 111, 

5 W(i, i) = £ *» n J , (EQ17) 

/T 0 5 = 0, ...,M 2 - 1 

The operations in Equation 17, and other DFT steps described here, can be done 
by one of several DFT algorithms, but the emphasis here is the SLT. 

Step 3 113. This follows by performing DFT along the column (index j) to obtain 
W(k,s) 117, 

(EQ18) 

10 „, v t 1 - -*t? * = 0 M, - 

'(*,*) = = W r (k,s)+jW i (k,s) ' \ 1 

1 = 0 s = 0,...,M 2 - 

Step4 115. This step deals with the separation of W(k, s) 117, which has com- 
plex data format, into Y x {k, s) 119 and Y 2 (k, s) 121, which are the DFTs of 
Vj (i, /) 103 and y 2 (i, 0 105, respectively. This can be accomplished by using 
the following formulas 

(EQ19) 



15 



T lr (k, s) = i [W r (k,s) + W r {M 1 -k t M 2 -s)] 
Y u (k, s) = i [W, (k, s) - W { (Mj - k, M 2 - s) ] 
Y^Vcs) =^[WiVcs) +W i (M 1 -k,M 2 -s)] 
Y 2i (k, s) = -\ [W r (k, s) - W r {M l - k K M 2 - s) ] 



Jt = 0, — 1 

My 

5 = 0 T 



where 



Y lr (k,s) +jY u (k,s) ^Yidcs) 
Y 2r (k,s) +jY 2i {k,s) =Y 2 (k,s) 



k = 0,...,M l -l 
5 = 0, ...,M 2 -\ 



(EQ20) 
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Note that only half of 7 1 (k, s) 119 and Y 2 (k, s) 121 are computed and the 
other half are obtained by using die conjugate symmetry property of DFT of real 
data, 



Y lr (k,s) ^Y^iM^KMi-s) 
Y Xi {k,s) ^-YuWi-lcMi-s) 

Y 2i (k,s) =-7 2/ (M 1 -t,M 2 -.y) 



k = 0,...,M l -l 

s = -^- + 1, ...,M 2 — 1 



(EQ21) 



5 Step 5 123. The data blocks x x (i, 0 125 and x 2 (i, /) 127, both of the size of 
N 1 xN 2 , are first combined into z(i, /) 129 that has a size of M 1 xM 2 , 

(EQ22) 



,0 = 



Di £>2 ^1 ^2 



I -r^ + iVj — 



^ 2"> "2" "^^2" 

otherwise 



where = M 1 — ^ and I> 2 = M 2 ~N 2 * Again, this arrangement involves no 
operation. Equation 21 indicates the input data blocks 125427 occupy only the 
10 center portion of 2 (z,/) 129, and the rest of z(z*,7) 129 are filled with zeros. 

Step 6 13L Similar to the DFT procedures in Step! 109, DFT of the rows of 
z(f,Z) 129 (index/) yields Z(i,s} 135 

(EQ23) 



0 



otherwise 
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Note that only N x rows need to be DFTed because the other D x = M l -N 1 rows 
contain only zeros, as indicated by the index z. Further reduction in operation is 
possible since in each of N x M 2 -point DFT M 2 -N 2 input elements are zero, as 
indicated by the index 1 and a special SLT with fewer than normal inputs, as 
5 described above, can be used. 

Step 7 133. Applying DFT to the columns (index i) Z(i 9 s) 135 yields 
Z(i,s) 139. 



Ks) = X Z(t = Z r (k J s)+jZ i (k,s) 



(EQ24) 



The number of operations needed to compute the DFTs in Equation 24 can be 
10 reduced because M l -N 1 elements in each column are zero, as indicated by the 
index i. Hence, a special SLT with fewer than normal inputs, as described above, 
can again be used here. 

Step 8 137, The separation of Z (Jfc, s) 139 into X x ( Jfc, s) 141 and X 2 (k, s) 143 
is analogous to the process in Step 4 115, that is 



15 



X lr (k, s) = i [Z r (*. s) +Z r {M x -k,M 2 - 


■s)] 








X li Vcs)=±[Z i (k,s) ^Z,(M l -k t M 2 - 


s)] 




*=0, . 


.., A/j 


X 2r a,i)=5[Z i (i,s)+Z I .(M 1 -i l Af 2 - 


s)] 




5-0,. 


M 2 
'"~2 


XvVc s) = -i [Z r (*, s) -Z, (M x -k,M 2 


-s)1 








where 










X lr (k,s) +JX u (k,s) = X t (k,s) 1 


jfc: 


= o, 


A/j — 


1 


X^ (*, s) +JX 2i (jfc, s) = X 2 (k, s) J 


S : 


= o, 


• • >y h^ 2 "~ 


1 



(EQ26) 



And similarly the other half are obtained from conjugate symmetry property 
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X lr (k,s) =X lr (M 1 -k,M 2 -s) 
Xu&s) =-X li (M l -k,M 2 -s) 
X^iKs) =Z 2r (M 1 -ife,M 2 -j) 
X 2i (k,s) =-Z 2/ (M 1 -A,M 2 -j) 



k = ti, ...,16^1 

s = + 1, ...,M 2 — 1 



(EQ27) 



ftqw i-4 102,109,113415 and Steps 5-8 123,131433,137 are similar and can be 
performed in parallel if the necessary resources are available. 

Step 9 145. In this step the frequency data blocks 119,141 are multiplied to 
5 obtain U x {k,s) 149, 

(EQ28) 

U x (k, s) = X 1 (k, s) T*(k, s) Jc = 0, M x - 1 s = 0, 

Since Zj (k,s) 141 and Fj 119 have conjugate symmetry property, so 
does Uyiks) 149 and, as the result, only half of U x (k, s) 149 is computed. 

Step 10 147. In mis step U 2 (k, s) 151 is computed 

(EQ29) 

10 * Mo 

U 2 {k, s) = Z 2 (*, F 2 *(*. s) k = 0, My - 1 * = 0, .... 

Again, Step 9 145 and Sfiqj 10 147 are independent processes and can be per- 
formed in parallel. 

Step 11 153. Like forward DFT, the inverse DFT also operates on complex data. 
Therefore, to fully utilize M 1 y.M 2 inverse DFT, I^C*,*) 149 and 
15 U 2 {k,s) 151 must first be combined to form P (jfc, s) 155 as the following 



P(k,s) =P r {k,s)+jPi(k,s) 



*=0,.„,M 1 -1 
5 = 0, Af 2 - 1 



(EQ30) 



and 



P r ^k,s)=U lr {k,s)-U li (k,s) 
P i (k,s)=U 2r (k,s)+U l! (k,s) J 



Jfc=0, ...jMy-l 

s = Q,...,M 2 -\ 



ffQ31) 



in Equation 30, when the other half of U x (k, s) 149 and U 2 (k, s) 151 are used, 
20 the following conjugate syinmetry formula can be applied 



WO 93/23816 



PCT/US93/04658 



27- 



U lr (k,s) =U lr (M 1 -k,M 2 -s) 
U Vl (k,s) =-U li (M 1 -k,M 2 -s) 
Utiles) ^VtoWi-KMz-s) 
U 2i (k,s) = -U 2i (M 1 -k,M 2 -s) 



k = 0,...,M 1 -l 



.,M 2 -1 



(EQ32) 



Step 12 57. Once P(k,s) 155 isobtained, inverse DFTs can be performed along 
the columns (index it) to obtain P(i,s) 161 



Af,-1 



j— 1 = 0,....^,-! 



t = o. 



J = 0, ...,M 2 - 1 



(EQ33) 



Step i3 159. This follows by inverse DFT on rows (index s) to obtain 
P(UD 165 



2x1* 
J-7T- 



_ — i = 0, .-.-^-.A/j-— ...,A/j- 1 

=M~ 2 S P(, »* )e *» "PrttO+iPittO . * ^ (EQ34) 
* =0 / = 0,... ,— ,Af 2 — — ,...,Jf 2 _ l 

where £>j = M 1 -N l and D 2 = M 2 -N 2 . Note that, since only part of 
p (i, /) 165 is needed, the inverse DFT is only performed on D l + 1 rows, as 
10 indicated by the index i. Furthermore, since only D 2 + 1 columns are needed in 
p (i, I) , as indicated by the index /, a special SLT can again be used to save oper- 
ations. 

Step 14 163. In this final step, p (i, /) 165 must separated into two cross correla- 
tion blocks. Because of the way the merge is performed in Stepll 153, no com- 
15 putation is needed for Step 14 163 and the two cross correlation blocks 
c x («, m) 167 and Cj (#», m) 169 are given by 

(EQ35) 



n,m) = 



p r (n,m) n = 0,...,D l /2 m = 0, ...,D 2 /2 

p r {M l + n,M 2 +m) n = -1, ...,-£>j/2 m = -1, ...,-D 2 /2 

p r (n,M 2 +m) n s D,...,Z),/2 m = -1, ...,-D 2 /2 

^(Mj + n,^) n = -\,...,-D x /2 m = 0,...,D 2 /2 
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(EQ36) 



n,m) = 



Pi(n,m) n = 0 9 ...,D 1 /2 m = 0, ...,D 2 /2 

p i (M 1 + n 9 M 2 + m) n = -1, ...,-Z)j/2 /n = -1, ...,-<D 2 /2 

p i (n 9 M 2 ^m) n = O,...,!)^ m = -1, ...,-D 2 /2 

^(M^^m) n = -1, ....-i^/l /n = 0, ...,D 2 /2 



The indices n and ro where c l (n y m) 167 and c x (n, m) 169 are maximum are 
the motion vector estimates for the data blocks jc x (/,/) 125 and ^(^Z) 127. 

Computation of Cross Correlation - Method B 101 

5 In Figure 7, a set of procedures are presented to show Method B 101. In this 
method, only two real input blocks are used and the output is one cross correla- 
tion block corresponding to Equation 14. Similar to Method A 100, the block 
size is kept arbitrary: 

Step 1 207. Referring to Figure 7, the search block y (i, I) 201 of the size of 
10 M x xM 2 is first combined into a complex block $ (z, I) 209 of the size of 
My/2 x Af 2 by the following formula, 

(EQ37) 

JftO =y0\0+/y(^+U) i = o,.„,^i-i / = o,...,m 2 -i 

This step requires no operation. 

Sfq?22U.ThenDFTalongtherows6ndex/)of j>(i,/) 209 is performed to 
15 yield ?(z,$) 215 

(EQ38) 

t(i,s) = E 9(UDe J ^ = f r (i,s)+J?i{i 9 s) T" 1 
/=0 * J = 0,...,M 2 -1 

Step 3 213. This step deals with the split or separation of f (/, 5) 215 into a com- 
plex block f(i 9 s) 217 to be column DFTed, 
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Y r (i,s) = ±[t r (i,s)+f r (i,M 2 -s)] 
fi(i,s) = \ l?i(i,s) -?i(i,M 2 -s)] 
Y r d +-p s) = ± [7, (/, j) + 7; (i, M 2 - s) ] 



1 r * 



7, (/ + -£,s) = -j [Y r (i, s) - Y r (/, M 2 - s) ] 



1=0, 



M 2 

s — 0, 



(EQ39) 



where 

7, (i, s) +jYj (i, s) = 7(i, j) l = 0, M x - 1 j = 0 M 2 - 1(EQ40) 

Dae to the conjugate symmetry, the other half of Equation 38 need not be com- 
5 puted and are simply given by 



Y r (i,s) = Y r (i,M 2 -s) 
Yi(i,s) --7,atf 2 -*) 



1 = 0, ....Afj-l 
s = ~y + 1, ...,M 2 -1 



(EQ41) 



Step 4219. In this step DFTs are performed along the columns (index 0 of 
Y(i,s) 217 to yield Y (i, s) 221 



(EQ42) 



2mt 



k = 0 Mj-1 

7(M) = £ y(U)e"-'"«r = 7 r (*,5) +/7,.(*,*) M 
i=0 s = 0 — 

10 which is the frequency domain representation of the data block y (i, /) 201. 
Again, the other half of the block 221 have conjugate symmetry and are given by 



(EQ43) 



7(*,M 2 -5) = 7*^-*,*) * = 0 M x -l 5 = 1 -J-l 

which are not computed. 

Sfcp 5 223. The data block x (i, /) 203 of the size of N x x AT 2 is first combined 
15 into a complex data block St ( i, I) 225 of the size of M x /1 x M 2 
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CK! 44) 



0 = 



xO- T ,/- T ) + ;,0- T + T ,/- T ) 

2 * 2 2. 
0 otherwise 



where D x = M l -N 1 and/> 2 = M 2 -N 2 . 

Steptf227.Thisisfonowedbyi^ 225, 

(EQ45) 



) = X HWe 3 ** =t(i,s)+jXiU,s) 2""* 2* + T : 
/=i V 2 j = 0 M 2 -l 



5 Note that since D 2 — M 2 ~N 2 elements are zero for each row DFT, as indicated 
by index /, a special SLT can be used, like before to save compulations. 

Step 7229. This step is similar to Step 3 213 and X (i, s) 231 is split into a com- 
plex block X (i, s) 233 to be column DFTed, 



X,- (f, s) = | & (i, j) (i, M 2 - s) ] 
2 '" ; = 2' 



Mi 1 ^ . 
X r (/+ s) = ^ [Z f (/, j) +± f (i, M 2 - j) ] 



10 where 



i = 0,...,^i-l 

(EQ46) 

4 = 0, 



(EQ47) 

f = 0.....M!-! $ = 0,...,M 2 -1 
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Similarly, due to conjugate symmetry, the other half of the block are simply 
given by 



X r (i,s) =X r (/,M 2 -5) 
Xi(i,s) B-Xg&Mi-s) 



i = 0, ...,M x -\ 

M 2 (EQ48) 
s = + 1, A/ 2 — 1 



Step 8235. In this step, DFTs along the columns (index i) of X(i, s) 233 are 
5 computed, 

(EQ49) 

0/2+^,-1 ,2iti* k = 0, M Y - 

1 = 0/2 5 = 0, — 

Similarly, = M x -N x elements are zero for each of the column DFTs, as 
indicated by index i, a special SLT can be used to reduce the number of computa- 
tions. Conjugate symmetry property gives the other half of the block 

,o «*» 

M 

X(k,M 2 -s) = X*(M 1 -k,s) k = 0,...,M 1 -l j=l,... 2-1 

Step 9239. The frequency domain multiplication is performed in this step to 
obtain U(k,s) 241, 

M 

U(k,s) =X(k,s)Y*(k,s) k = 0,... t M l -l s = 0,...,-=2(EQ51) 

15 

like X (k, 5) 237 and Y (*, s) 221, U {k, s) 241 has conjugate symmetry and 
the other half are not computed. 

Step 10 243. IDFTs are performed on the columns (index it) of U (Jfc, s ) 241 in 
this step, 

(EQ52) 

M, - 1 .2*1* i = 0, Mj - 

20 ^ = ^ ^U(Ks)e^ = U r {i,s)+jUdi,s) ^ m 2 

Srep 11249. This step involves splitting the block U(k,s) 245 into another 
block (J (k,s) 247 before the row IDFTs are performed. The split is done 
according to the following formula, 
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where 



M 



i = 0,...,— — 1 5=0,...,— 



(EQ53) 



r r (i,Jtf 2 -s) +or i (i+^,Mj- J ) i = a ....-^-1 * = M 2 -l 



» = 0, — — 1 5=0 — 



(EQ54) 



-l/ / (i,Af 1 -i)+l7 r (i+^,Jlf 2 -i) «=0,...,-j--l f = -^ + 1 M 2 -l 



0", J) = U r (i, 5) (I, s) i = 0, -± - 1 j = 0, M 2 - |EQ55) 

Step 12 25L IDFTs now are performed on the rows of U(i, s) 247 (index s) to 
obtain the spatial domain block ft (/, /) 253 

(EQ56) 



Mj-l _2xir 
2-t =0 



f = 0....,^-l 



10 



Note that only Z>j/2 rows are IDFTed since die resultant cross correlation has 
smaller d imen sions which are indicated in the next step. Furthermore, since only 
D 2 + 1 columns are needed in U (i, /) , as indicated by the index /, a special SLT 
can be used to save computations. 

Step 13 255. Finally hi mis step, die cross correlation c(n,m) 205 is obtained, 



(EQ57) 



l,I7l) = 



U r (n,m) 
M 



n = 0, ...,D X /1 tn — 0, ...,D 2 /2 



fi,-("2-+ii,M 2 +m) n = -1, ...,-D x /l m = -1, ...,-D 2 /2 
a r (n,M 2 +jn) « = 0, ....Z)^ m = -1, ...,-D 2 /2 

(-y + n, m) « = -1, -D x /1 m = 0, .... 0 2 /2 



15 



This adjustment requires no operations. The indices n and m at which 
c(n,m) 205 is maximum is the motion vector estimate for the data block 
x(i,Z) 203. 
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WHAT IS CIAIMED IS; 

1. A method of comparing a specified image with a 
multiplicity of other images, the steps of the method 
comprising: 

(a) selecting a first one of said multiplicity of 

5 other images; 

(b) cross-correlating said specified image with 
said selected one of said multiplicity of other images by 
selecting two blocks of real data representing said specified 
image and combining them into one block of complex data by 

10 making the first block of said blocks as the imaginary part, 
repeating this step for the selected one of the multiplicity 
of other images to generate their frequency domain represen- 
tations , multiplying said frequency domain representations to 
form new, symmetrical blocks of data, merging the blocks of 

15 the above step to form a single block of complex data, 
performing a set of inverse DFT's along the rows of the block 
resulting from the above step, splitting the resultant block 
from the preceding step into two blocks by taking the real 
and imaginary parts of said resultant block, and searching 

20 the resulting blocks for their largest value to provide a 
result; 

(c) repeating step (b) for at least a plurality of 
other ones of said multiplicity of other images; and 

(d) comparing said results of repeated step (b) , 
25 selecting the one of the other images that best matches said 

specified image, and transmitting a signal corresponding 
thereto. 
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